Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

46 ◾ Bioinformatics

There are several programs for quality assessment, but FastQC is the most popular one.

FastQC is a user-friendly program to assess the quality of the reads generated by any of the

sequencing technologies, and it produces a report that summarizes the results in graphs

that are easy to interpret. The potential quality problems include low-quality bases, pres-

ence of adaptor sequences connected to the reads, presence of adaptor dimers or other

technical contaminating sequences, overrepresented PCR sequences, sequence length dis-

tribution, per base sequence content, per sequence GC content, per base N content, and

k-mer content. The per base sequence quality and adaptor content are the most important

metrics that we should look at and take the appropriate action. The ideal sequencing data

are the one without warnings or failed metrics. Therefore, we should try to fix the prob-

lems as possible. However, some problem may not be solved. If the unsolved problem does

not affect the reads severely, that data still can be used in the analysis. However, we must

be aware that unsolved problems may have some negative impact in the results. The read

quality problems can be solved based on the failed metrics by removing low-quality reads,

trimming the reads from the beginning and the end of the reads, and masking the bases

with low-quality scores. There are several programs for the processing of raw sequence

data. FASTX-toolkit is the most popular one for single-end FASTQ files, and Trimmomatic

is more sophisticated and can be used for both single-end and paired-end raw data. Fastp

filters low-quality reads and automatically recognizes and trims adaptor sequences. It is

important to process the paired-end FASTQ files (forward and reverse) together to avoid

leaving out singletons, which may not be accepted by almost all aligners. In this chapter, we

discussed the command-line programs for quality controls. However, those programs or

similar ones are implemented in Python, R, and other programing languages, but under-

standing the general principle for checking the raw data quality and solving potential qual-

ity problems are the same. Most sequencing applications use these kinds of QC processing,

but when we cover the metagenomic data analysis, you will learn how to preprocess micro-

bial raw data using different programs. Once the raw sequencing data are cleaned, then we

can move safely to the next step of sequence data analysis depending on the application

workflow that we are adopting.

REFERENCES

1. Holley RW, Everett GA, Madison JT, Zamir A: Nucleotide sequences in the yeast alanine

transfer ribonucleic acid. J Biol Chem 1965, 240: 2122–2128.

2. Jou WM, Haegeman G, Ysebaert M, Fiers W: Nucleotide sequence of the gene coding for the

bacteriophage MS2 coat protein. Nature 1972, 237(5350):82–88.

3. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman

MS, Chen Y-J, Chen Z et al: Genome sequencing in microfabricated high-density picolitre

reactors. Nature 2005, 437(7057):376–380.

4. Braslavsky I, Hebert B, Kartalov E, Quake SR: Sequence information can be obtained from sin-

gle DNA molecules. Proceedings of the National Academy of Sciences 2003, 100(7):3960–3964.

5. Rhoads A, Au KF: PacBio sequencing and its applications. Genomics, Proteomics &

Bioinformatics 2015, 13(5):278–289.

6. Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW: Zero-mode wave-

guides for single-molecule analysis at high concentrations. Science 2003, 299(5607):682–686.